refactor: building within_intron and spliced subsets #134

ns-rse · 2025-01-10T17:12:55Z

Closes #91
Closes #92

Extracts the duplicated code that builds subsets of reads that are within introns and within spliced_3ui and abstracts
out to functions with tests. Currently the tests fail and I don't think they should so I need to work out why and
perhaps add some additional fixtures to use here.

ns-rse · 2025-01-13T12:56:53Z

Hi @IanSudberry,

I've not sussed out why the tests fail yet but have been looking closely at some of the code and logic and have some
questions I was wondering you could help with.

In this PR I've noted...

            # Why add an empty list and append a tuple?
            # Should we not be getting the keys of within_intron to check transcript_id is not in?
            if transcript_id not in within_intron:
                within_intron[transcript_id] = []
            within_intron[transcript_id].append((chromosome, start, end, strand))

...and thinking it through I realised that the reason this might be so is if there is more than one transcript_id
coming out of a given pair_feature. If this is the case the resulting value for a given transcript_id is a list of
tuples. The following dummy example demonstrates this

first = (1, 2, 3, 4)
second = ("a", "b", "c", "d")

test = {}
test["first"] = []
test["first"].append(first)
test["first"].append(second)
test["first"]

[(1, 2, 3, 4), ('a', 'b', 'c', 'd')]

Reading through all_introns_counts_and_info.py I looked for where read1_within_intron / read2_within_intron are
subsequently used and there are two places.

Line 239 - `all_dicts`

This makes a list of four
dictionaries

            all_dicts = [read1_within_intron, read2_within_intron, read1_spliced_3UI, read2_spliced_3UI]

...but all_dicts is never used anywhere (not a problem for now but we can remove it during refactoring).

Lines 246-259 - `assign_conversions_to_retained`

The only other place that read1_within_intron / read2_within_intron are then used is building a set of
assign_conversions_to_retained on lines
246-259...

            # Create a set of tuples: (tx_id,(start,end))
            # Retained
            assign_conversions_to_retained = []

            for transcript_id, start_end_list in read1_within_intron.items():
                for start_end in start_end_list:
                    assign_conversions_to_retained.append((transcript_id, start_end))

            for transcript_id, start_end_list in read2_within_intron.items():
                for start_end in start_end_list:
                    assign_conversions_to_retained.append((transcript_id, start_end))

            # Only need each unique element once
            assign_conversions_to_retained = set(assign_conversions_to_retained)

The for ... loop here uses .items() and so transcript_id is the key and start_end_list is a list, commonly of
length 1, but possibly longer and these are then collected into a list before being de-duplicated.

I'm wondering if we can build a set more directly so wanted to check I've understood the intention correctly which is to
get a set of unique transcript_id and associated chromosome,start,end,strand?

Closes #91 Closes #92 Extracts the duplicated code that builds subsets of reads that are within introns (or span introns) and within spliced_3ui and abstracts out to functions with tests. Currently the tests fail and I don't think they should so I need to work out why and perhaps add some additional fixtures to use here.

IanSudbery · 2025-01-13T15:48:16Z

Okay, this all a bit confusing, but....

The XT tag on a read gives the gene_id of the gene to which the read is mapped. A gene can have multiple transcripts.

utrons1/2 contain a list of all the introns associated with the gene in gene_id. If two transcripts contain the same intron, this will be in those lists twice. Consider the following situation:

                                                 Genome ----------------------->

Read Alignment Blocks                     |>>>>|                     |>>>>>>>>>>>>>>>|
Transcript 1:                      |===========|---------------------|=========|-------------------|==========|
Transcript 2:              |====|------------------------------------|=========|-------------------|==========|

Introns[0]                                      ---------------------
Introns[1]                                                                      -------------------
Introns[2]                       ------------------------------------
Introns[3]                                                                      -------------------

Exon : |========|   Read alignment block:  |>>>>>|

Note that the introns list (which could be utrons1 or utrons2, but is likely to be the same for both) contains four introns.

The loop goes through these lists and for each intron tests whether the current read supports either the splicing or retention of each intron.

In this case, the final dictionaries will look like:

within_intron = {"transcript1": [introns[1]], "transcript2": [introns[3]] } # read supports retention of introns[1] and introns[3] 
read1_spliced_3UI = {"transcript1": [Introns[0]]}  #  read supports splicing of introns[0]

Note that in theory, a read could support to retention (or splicing) of multiple introns within one transcript (although I think its probably unlikely), so you could get within_intron = {"transcript1": [intron[1], intron[4]]"}

In the end, we want data per intron, not per transcript, so where two transcripts contain the same intron, they are eventually collapsed onto the same record (i.e. here we have both introns[1] and introns[3], but we really only need that interval once).

ns-rse · 2025-01-13T16:09:30Z

Great, thanks for that explanation @IanSudbery really useful. I will see about getting this into the documentation (noted in #136).

I've worked out some transcripts that are more useful for the tests in this Pull Request and will be updating the Pull Request in a bit (just finishing them off).

Identified instances where meaningful results are obtained when filtering for instances where the start/end are within introns or there is splicing. Tests are not fully comprehensive yet as not all logic statements are covered, in particular when a transcript spans a whole region but I'm currently a bit hazy on whether this will always happen as the `query_length` is always `150` (`read_length` is longer and was the cause of previous errors).

ns-rse · 2025-01-13T18:19:19Z

Identified instances where meaningful results are obtained when filtering for instances where the start/end are within introns or there is splicing.

Tests are not fully comprehensive yet as not all logic statements are covered, in particular when a transcript spans a whole region but I'm currently a bit hazy on whether this will always happen as the query_length is always 150 (read_length is longer and was the cause of previous errors in regression tests).

ns-rse marked this pull request as draft January 10, 2025 17:13

ns-rse added tests Testing issues refactor Refactoring labels Jan 10, 2025

ns-rse force-pushed the ns-rse/91-conditional-extraction branch from 743b339 to ce1c18d Compare January 13, 2025 13:38

ns-rse mentioned this pull request Jan 13, 2025

Add explanation of alignment #136

Open

ns-rse marked this pull request as ready for review January 13, 2025 17:25

ns-rse requested review from IanSudbery and jjriley1 January 13, 2025 18:10

ns-rse force-pushed the ns-rse/91-conditional-extraction branch from 5f9fed7 to 552d5f3 Compare January 13, 2025 18:18

ns-rse changed the title ~~refactor(wip): building within_intron and spliced subsets~~ refactor: building within_intron and spliced subsets Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: building within_intron and spliced subsets #134

refactor: building within_intron and spliced subsets #134

ns-rse commented Jan 10, 2025

ns-rse commented Jan 13, 2025

IanSudbery commented Jan 13, 2025

ns-rse commented Jan 13, 2025

ns-rse commented Jan 13, 2025 •

edited

Loading

refactor: building within_intron and spliced subsets #134

Are you sure you want to change the base?

refactor: building within_intron and spliced subsets #134

Conversation

ns-rse commented Jan 10, 2025

ns-rse commented Jan 13, 2025

Line 239 - all_dicts

Lines 246-259 - assign_conversions_to_retained

IanSudbery commented Jan 13, 2025

ns-rse commented Jan 13, 2025

ns-rse commented Jan 13, 2025 • edited Loading

Line 239 - `all_dicts`

Lines 246-259 - `assign_conversions_to_retained`

ns-rse commented Jan 13, 2025 •

edited

Loading